Power and throughput optimization of an unbalanced instruction pipeline

ABSTRACT

A method includes determining a rate of resource occupancy of a constituent stage of an unbalanced instruction pipeline implemented in a processor through profiling an instruction code. The method also includes performing data processing at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy.

FIELD OF TECHNOLOGY

This application is a Divisional of prior application Ser. No.13/089,101, filed Apr. 18, 2011, currently pending;

And also claims priority from Indian Provisional Application Serial No.1129/CHE/2010 filed on Apr. 20, 2010, entitled “POWER AND THROUGHPUTOPTIMIZATION OF AN UNBALANCED INSTRUCTION PIPELINE”, which isincorporated herein by reference in its entirety. Embodiments of thedisclosure relate to instruction pipelining in processors.

BACKGROUND

Instruction pipelining is a technique used in processors (e.g.,microprocessors, microcontrollers) to allow for parallel processing ofinstructions. For example, one instruction is associated with a firststage of an instruction pipeline and another instruction is associatedwith a second stage of the instruction pipeline. The instructionpipeline allows for “breaking” of the timing associated with a largedata path, and provides parallelism in executing the instructions at anincreased clock frequency.

The instruction pipeline offers optimum performance only when theconstituent stages are perfectly balanced. A balanced pipeline impliesthat processing associated with a constituent stage of the pipelinetakes a completion time equal to the completion time associated with allother constituent stage(s) of the instruction pipeline. However, thereare scenarios (e.g., hard macro(s) such as memory/memories being in thedata path of the pipeline, Arithmetic Logic Units (ALU units) such asmultipliers, adders, bit shifters and dividers being in a sameconstituent stage of the pipeline) where a programmer/user is not ableto perfectly balance the instruction pipeline. Here, the maximumfrequency at which the unbalanced pipeline is clocked is determinedthrough the constituent stage therein offering the maximum delay.

Assuming no stalls in an unbalanced instruction pipeline, the maximumfrequency, at which the unbalanced instruction pipeline is clocked isexpressed in example Equation (1) as:

$\begin{matrix}{{f_{\max} = \frac{1}{d}},} & (1)\end{matrix}$

where d is the maximum delay offered by a constituent stage.

Assuming the time taken for executing N instructions to be (N+n_(s))cycles (n_(s) being the number of constituent stages of the unbalancedinstruction pipeline), the effective throughput, E, is be expressed inexample Equation (2) as:

$\begin{matrix}{E = {{f_{\max} \cdot \frac{N}{\left( {N + n_{s}} \right)}} \sim f_{\max}}} & (2)\end{matrix}$

The throughput, E as seen in Equation (2), is the number of instructionsper second. Increased throughput is associated with a higher f_(max),which implies a lower maximum delay offered by the constituent stage ofthe unbalanced instruction pipeline.

The pipeline can be clocked at a frequency higher than that computedbased on the max-delay, and when the usage of timing-path involving themax-delay is detected, then the pipeline can be stalled for a number ofcycles equivalent to the delay offered by the timing-path. This is knownas pipeline stalling.

With the above approach, the frequency might not be optimal, if theusage of the timing-path involving max-delay is not frequent. It wouldlead to unnecessary dynamic power dissipation. Hence, there is a need toarrive at an optimum frequency for a given rate of usage of thetiming-path involving the maximum delay.

SUMMARY

In one aspect, a method includes determining a rate of resourceoccupancy of a constituent stage of an unbalanced instruction pipelineimplemented in a processor through profiling an instruction code. Themethod also includes performing data processing associated with theunbalanced instruction pipeline at a maximum throughput at an optimumclock frequency based on the rate of resource occupancy

In another aspect, a method determining a time interval within aprocessing time associated with a constituent stage of an unbalancedinstruction pipeline implemented in a processor based on a change in aprocessing scenario associated with data processing therein. The methodalso includes dynamically determining a rate of resource occupancy ofthe constituent stage periodically with a time period equal to thedetermined time interval through profiling an instruction codeassociated therewith. Further, the method includes periodicallyobtaining a clock frequency associated with the rate of resourceoccupancy of the constituent stage and performing the data processing atthe periodically obtained clock frequency. The clock frequencycorresponds to an optimized power consumption and/or a throughputassociated with the unbalanced instruction pipeline.

In yet another aspect, a computing system includes a processor having anunbalanced instruction pipeline implemented therein and a memoryconfigured to store an instruction code associated with processingthrough the unbalanced instruction pipeline. The computing system alsoincludes a determination module configured to determine a rate ofresource occupancy of a constituent stage of the unbalanced instructionpipeline through profiling the instruction code associated withprocessing through the unbalanced instruction pipeline. The processor isconfigured to perform data processing at a maximum throughput at anoptimum clock frequency based on the rate of resource occupancy.

Other features will be apparent from the accompanying drawings and fromthe detailed description that follows.

BRIEF DESCRIPTION OF THE VIEWS OF DRAWINGS

FIG. 1 is a schematic view of a data path and a control path associatedwith an unbalanced instruction pipeline, according to one or moreembodiments.

FIG. 2 is an illustrative view of an example processing scenarioassociated with the unbalanced instruction pipeline of FIG. 1.

FIG. 3 is a schematic view of logic associated with a pipeline controlunit configured to dynamically profile an instruction code associatedwith a constituent stage of the unbalanced instruction pipeline of FIG.1.

FIG. 4 is a plot of throughput associated with a constituent stage ofthe unbalanced instruction pipeline of FIG. 1 as a function of a clockfrequency for different example values of the rate of resource occupancyassociated with the constituent stage.

FIG. 5 is a schematic view of a computing system including a processorin which the unbalanced instruction pipeline of FIG. 1 is implemented.

FIG. 6 is a process flow diagram detailing the operations involved in amethod of performing optimum data processing through the unbalancedinstruction pipeline of FIG. 1, according to one or more embodiments.

FIG. 7 is a process flow diagram detailing the operations involved in amethod of performing optimum and dynamic data processing through theunbalanced instruction pipeline of FIG. 1, according to one or moreembodiments.

Other features of the present embodiments will be apparent from theaccompanying drawings and from the detailed description that follows.

DETAILED DESCRIPTION

Disclosed are a method, an apparatus and/or a system to optimize powerand throughput in an unbalanced instruction pipeline implemented in aprocessor associated therewith. Although the present embodiments havebeen described with reference to specific example embodiments, it willbe evident that various modifications and changes is made to theseembodiments without departing from the broader spirit and scope of thevarious embodiments.

FIG. 1 illustrates a data path 162 and a control path 164 associatedwith an unbalanced instruction pipeline 100, according to one or moreembodiments. An instruction code associated with processing throughunbalanced instruction pipeline 100 is stored in program memory 102.Program memory 102 is a Read-Only Memory (ROM). In some cases, a datamemory (not shown) in the form of a Random Access Memory (RAM) is usedto store intermediate results and variables associated with theprocessing. Program memory 102 may also be configured to store constantsassociated with the processing. Instructions stored in program memory102 is decoded through instruction decoder 104 and matching controlsignals for the pipelined data path 162 is generated.

The aforementioned operations (e.g., instruction decoding) constitutestage 1 106 of unbalanced instruction pipeline 100. In the exampleembodiment shown in FIG. 1, unbalanced instruction pipeline 100 is shownto include stages (e.g., stage 1 106, stage 2 108, stage 3 110, stage 4112). The unbalanced instruction pipeline 100 includes more than fourstages or even less than four stages, and that the four stages shown inFIG. 1 merely serve as an example. In another example embodiment, stage1 106 is associated with an instruction fetch operation, stage 2 108 isassociated with an instruction decode operation, stage 3 110 isassociated with an execute operation, stage 4 112 is associated with amemory access operation, and stage 5 (not shown) is associated with awrite back operation.

Registers are inserted between stages of unbalanced instruction pipeline100. Specifically, in one or more embodiments, output of each stage isan input to a flip-flop (e.g., FF₁ 114, FF₂ 116, FF₃ 118, FF₄ 120). Forexample, as shown in FIG. 1, D flip-flops are for the aforementionedpurpose. Each D flip-flop is configured to receive the output of theprevious stage (e.g., instruction decoder 104 output, output of Dflip-flop (Q)) as the D input thereof. Each flip-flop (e.g., FF₁ 114,FF₂ 116, FF₃ 118, FF₄ 120) is clocked through a clock generation circuit(e.g., CLK GEN 1 132, CLK GEN 2 134, CLK GEN 3 136, CLK GEN 4 138).Program memory 102 also has a clock generation circuit (e.g., CLK GEN 0130) associated therewith. In an example embodiment, the clockgeneration circuit includes a crystal oscillator. The clock generationcircuits (e.g., CLK GEN 0 130, CLK GEN 1 132, CLK GEN 2 134, CLK GEN 3136, CLK GEN 4 138) associated with the individual stages are controlledthrough pipeline control unit 150.

Unbalanced instruction pipeline 100 may include a data path 162 and acontrol path 164. As shown in FIG. 1, data path 162 may includeflip-flops configured to latch onto and propagate data to succeedingstages. Control path 164 may include control elements (e.g., controlelement 1 142, control element 2 144, control element 3 146, controlelement 4 148) configured to control data processing through the stagesof unbalanced instruction pipeline 100. For example, control elements isconfigured to assert a signal to enable data transfer through data path162 at an output. Flip-flops are used as control elements in unbalancedinstruction pipeline 100. In one or more embodiments, pipeline controlunit 150 is also configured to control clock gating (to be discussedbelow) and data forwarding through each stage of unbalanced instructionpipeline 100 using the decoded instruction control signals availablethrough control elements. Further, pipeline control unit 150 isconfigured to utilize the decoded instruction control signals from eachstage of unbalanced instruction pipeline 100 to detect data hazardstherein.

In the example embodiment of FIG. 1, stage 3 110 includes logicassociated therewith. Specifically, FIG. 1 illustrates stage 3 110 asincluding logic 1 122, logic 2 124, and logic 3 126. Also, a multiplexer(MUX 128) may select one of logic 1 122, logic 2 124 and logic 3 126based on a control signal. It is noted that there is more logic unitsassociated with stage 3 110. Logic 1 122, logic 2 124 and/or logic 3 126is Arithmetic Logic Units (ALU units) (e.g., multiplier, adder, bitshifter, divider). For the sake of convenience in understanding, it isassumed that logic 1 122 is a divider, logic 2 124 is an adder, andlogic 3 126 is a multiplier, and that a task completion time associatedwith logic 1 122 is 15 nanoseconds (15 ns), a task completion timeassociated with logic 2 124 is 2 ns, and a task completion timeassociated with logic 3 126 is 5 ns. The task completion timesassociated with all other stages (e.g., stage 1 106, stage 2 108, stage4 112) is assumed to be 2 ns.

Thus, the maximum delay associated with unbalanced instruction pipeline100/stage 3 110 will be 15 ns. Further, it is assumed that theprobability of logic 1 122 being utilized during processing is lowerthan the probability associated with the use of logic 2 124 and logic 3126. In other words, MUX 128 is configured to select logic 2 124 orlogic 126 more frequently than logic 1 122. If unbalanced instructionpipeline 100 is clocked at a frequency associated with the maximum delayin stage 3 110 (e.g., 15 ns due to logic 1 122), the throughput (see,e.g., Equation (2)) associated with unbalanced instruction pipeline 100is limited as the clock frequency is limited (e.g., to a maximum of 66.7MHz) and the probability of use of logic 1 122 is low.

Thus, it is beneficial to clock unbalanced instruction pipeline 100 at afrequency higher than the example 66.7 MHz discussed above. For example,unbalanced instruction pipeline 100 is clocked at a frequency associatedwith the smallest delay associated with any of the constituent stages(e.g., stage 1 106, stage 2 108, stage 3 110, stage 4 112). In theexample scenario discussed above, the smallest delay associated with thestages is 2 ns. Therefore, unbalanced instruction pipeline 100 isclocked at a frequency associated with 2 ns (i.e., 500 MHz).

Whenever the use of logic 1 122 is required, the execution (or, taskcompletion) associated with stage 3 110 and the previous stages thereof(e.g., stage 2 108, stage 1 106) is stalled for at least a number ofclocks corresponding to the delay associated with logic 1 122 (e.g., 15ns). The minimum number of 2 ns clocks required to cover 15 ns is 8.Thus, execution associated with logic 1 122, logic 2 124, and logic 3126 of stage 3 110 are stalled for eight clock cycles, one clock cycleand three clock cycles respectively. Stalling is accomplished throughgating the clock inputs to the flip-flops associated with stage 3 110(e.g., FF₃ 118) and the previous stages thereof (e.g., stage 2 108 andFF₂ 116, stage 1 106 and FF₁ 114). In one or more embodiments, newinstructions are prevented from entering unbalanced instruction pipeline100 during the stall.

Clock gating for the purpose of stalling is controlled by pipelinecontrol unit 150 (to be described below). Clock gating is controlledthrough control elements (e.g., control element 3 146, control element 2144, control element 1 142), in association with pipeline control unit150. At the simplest level, an AND gate (not shown) is employed for theclock gating. Here, the signal(s) associated with the stages (e.g.,stage 3 110, stage 2 108, stage 1 106) that are stalled is inverted andinput to the AND gate. The clock signals generated from the clockgeneration circuits (e.g., CLK GEN 3 136, CLK GEN 2 134, CLK GEN 1 132,CLK GEN 0 130) may also be input to the AND gate. Whenever the signal(s)is high, the inverted input to the AND gate is low and the clock outputof the AND gate is also low, regardless of the state of the clockinputs. Clock gating circuits are known to one skilled in the art, and,therefore, discussion of more examples thereof is skipped for the sakeof convenience.

In one embodiment, constituent stages of unbalanced instruction pipeline100 include multi-cycle paths. Stage 3 110, for example, may include amulti-cycle path through logic 1 122. The multi-cycle path may requiremore than one clock cycle for completion of the task associatedtherewith. The task initiation is accomplished through a sourceflip-flop changing a state thereof, following which the result of theexecution is transmitted to a destination flip-flop. The timing checksassociated with the aforementioned stall process is part of, forexample, a Static Timing Analysis (STA) utilized. Also, the multi-cyclepath discussed above is defined during the STA by the programmer/user ofa computing system executing tasks associated with unbalancedinstruction pipeline 100.

If the probability of use of logic 1 122 for processing is high, thenumber of stalls increases for every instruction associated with theaforementioned processing. Thus, dynamic power consumption is impactedas the number of clock cycles is proportional to the dynamic power. Inaddition, unbalanced instruction pipeline 100 has clock buffers, theconstituent flip-flop(s) of which toggles at rising/falling edges ofclock pulses. This may contribute to increased dynamic powerconsumption. Therefore, in the abovementioned example, it is preferableto clock unbalanced instruction pipeline 100 at a frequency lower than500 MHz.

It is possible to determine the rate of resource occupancy associatedwith an instruction/a constituent stage of unbalanced instructionpipeline 100 through profiling an instruction code associated therewith.In the example described above, an instruction is associated withdivision, multiplication and addition. For example, logic 1 122 isassociated with division operations, logic 2 124 is associated withmultiplication operations, and logic 3 126 is associated with additionoperations. The rate of use (i.e., resource occupancy) of logic 1 122,logic 2 124 and logic 3 126 is expressed in example Equation (3) as:

$\begin{matrix}{R_{1,2,3} = \frac{N_{{division},{multiplication},{addition}}}{N}} & (3)\end{matrix}$

where R₁, R₂ and R₃ are the rates of use of logic 1 122, logic 2 124 andlogic 3 126 respectively, N is the number of instructions, andN_(division), N_(multiplication) and N_(addition) are the number ofdivision, multiplication and addition instructions respectively.

As discussed above, R₁, R₂ and R₃ is obtained through profiling theinstruction code associated with processing through stage 3 110 ofunbalanced instruction pipeline 100. For example, compiling/executingthe instruction code associated therewith yields R₁, R₂ and R₃. Also,the rate of resource occupancy may depend on a system level scenario inwhich the instruction code is executed. Thus, obtaining the rate ofresource occupancy associated with a stage (or, a sub-stage) ofunbalanced instruction pipeline 100 may include monitoring utilizationof a processor/memory associated therewith. Parameters associated withthe aforementioned monitoring also includes instruction cache (e.g.,instruction cache associated with program memory 102) hits/misses anddata cache (e.g., data cache associated with data memory) hits/misses.

The instruction cache and the data cache may, respectively, allow forincreased speed of an instruction fetch process and a data fetch/storeprocess. In order to monitor these parameters, performance counters (or,registers) are employed in the processing/operating environmentassociated with processing through unbalanced instruction pipeline 100.The performance counters (or, registers) are configured to keep track ofthe abovementioned processor/memory utilization and/or a number ofinstruction/data cache hits/misses. The number of stall cyclesassociated with a clock frequency (e.g., 500/66.7 MHz) is estimatedthrough the delay (e.g., 2 ns/15 ns) associated with the stage ofunbalanced instruction pipeline 100, as discussed above.

In certain scenarios, rate vectors <R> is constant throughout run-time.For example, the instruction code being executed is associated with areliability test of a product, which may take values of the sameparameters that are approximately close to one another on different daysand check for continued reliability. In such scenarios, an initialprofiling of the instruction code may suffice to determine the ratevectors <R>. The clock frequency and the number of stall cycles is keptconstant for the instruction code. In other scenarios, the rate vectors<R> may not be constant throughout run-time, and is changed dynamically,as will be discussed below.

FIG. 2 illustrates an example processing scenario, according to one ormore embodiments. It is assumed that there is a processor in whichunbalanced instruction pipeline 100 is implemented. The processor isconfigured to support video processing 202 for the first 10 seconds (s)of an operation. Audio processing 204 for the next 20 seconds and,again, video processing 206 for the next 10 seconds. Video processing206 is analogous to video processing 202. As discussed above, R₁ isassociated with the rate of use of logic 1 122, R₂ is associated withthe rate of use of logic 2 124, and R₂ is associated with the rate ofuse of logic 3 126. As video processing (202, 206) involves operations(e.g., mathematical operations) that are different from that of audioprocessing 204, and the rate vector <R₁> (e.g.,(R₁, R₂, R₃)) associatedwith video processing (202, 206) is different from the rate vector <R₂>(e.g.,(R₁, R₂, R₃)) associated with audio processing 204, as shown inFIG. 2.

In the example shown in FIG. 2, the minimum occupancy time associatedwith audio/video processing (202, 204, 206) is 10 seconds. The minimumoccupancy time is then sampled at, for example, every 1 s, which is theinterval for estimating <R>. The pre-defined intervals for determining<R> are thus chosen based on the rate at which change in processingscenarios (e.g., audio processing 204, video processing (202, 206)) forthe processor.

Thus, <R> (e.g., <R₁>, <R₂>) is estimated at pre-defined intervals,depending on which clock frequency and stall cycles is updated to thehardware associated with the processing. As shown in FIG. 2, videoprocessing 202 involves a rate vector <R₁> for which clock frequency f₁and the associated stall vector <s₁> (e.g., (s₁, s₂, s₃)) is obtainedbased on maximizing throughput. Here, s₁ denotes the number of stallcycles associated with logic 1 122, s₂ denotes the number of stallcycles associated with logic 2 124, and s₃ denotes the number of stallcycles associated with logic 3 126.

At the end of the first 10 seconds, the clock frequency and the stallvector is updated in the hardware to f₂ and <s₂> (e.g., (s₁, s₂, s₃))respectively to allow for an optimum (e.g., maximum) throughput duringaudio processing 204. The clock frequency and the stall vector continuesto be f₂ and <s₂> for the next 20 seconds, although the associated ratevector <R₂> is still monitored for changes in the rates therein. At theend of the 20 seconds, the clock frequency and the stall vector isswitched to f₁ and <s₁> as audio processing 204 switches to videoprocessing 206. The aforementioned operations, including the calculationof <R>, are performed through pipeline control unit 150 havingassociated logic.

FIG. 3 illustrates logic associated with pipeline control unit 150configured to dynamically profile the instruction code associated withstage 3 110 of unbalanced instruction pipeline 100, according to one ormore embodiments. As shown in FIG. 3 and as discussed above, decodedinstruction control signals (e.g., decoded instruction control (stage 3)302 associated with stage 3 110) is input to pipeline control unit 150(e.g., to the aforementioned logic associated with pipeline control unit150). Counter 1 304, counter 2 306 and counter 3 308 is associated withcomputing a rate vector <R> associated with a processing scenario. Apre-defined interval for profiling an instruction code associated withthe processing scenario is chosen analogous to the example discussed inFIG. 2. It is assumed that there is M average number of instructions inthe pre-defined interval.

A Look Up Table (LUT) 312 is implemented in the logic associated withpipeline control unit 150 to obtain the clock frequency and stall cycles(or, stall vectors) for different values of rate vector <R>. LUT 312 isimplemented using a multiplexer having inputs to LUT 312 (e.g.,<R>=<R₁>, <R₂>) as select lines thereof. The output of LUT 312 is theclock frequency (e.g., f=f₁, f₂) and/or the stall vector (e.g.,<s>=<s₂>, <s₂>). At the end of every interval, the counters (e.g.,counter 1 304, counter 2 306, counter 3 308) is reset through intervalcounter 310. Interval counter 310 is also be configured to count thepre-defined intervals (e.g., interval period in FIG. 3). Implementationsof interval counters 310 are known to one skilled in the art, and,therefore, discussion associated therewith is skipped for the sake ofconvenience.

To summarize, in one or more embodiments, at every interval, thehardware associated with processing through unbalanced instructionpipeline 100 is updated with a new frequency and a stall vector, ifapplicable, based on a change in the rate vector (e.g., <R₂>) whencompared to the previous rate vector (e.g., <R₁>) associated with theprevious interval. Then, the counters (e.g., counter 1 304, counter 2306, counter 3 308) associated with computing <R> (e.g., <R₁>, <R₂>) isreset to begin with the next averaging.

FIG. 4 illustrates throughput 402 associated with a stage of unbalancedinstruction pipeline 100 as a function of clock frequency, f 404, fordifferent example values of the rate vector, <R> 406, according to oneor more embodiments. As discussed above, the rate vector, <R> 406 (e.g.,<R₁>, <R₂>), is determined from the compiled instruction code associatedtherewith. In one or more embodiments, the plot is obtained through aknowledge of stall vector, <s> 410 (e.g., <s₁>, <s₂>), associated withclock frequency, f 404 (e.g., f₁, f₂). Increasing clock frequency, f404, beyond a certain value (e.g., f₁, f₂) is not required as throughput402 may saturate beyond a certain value. FIG. 4 also shows a tableassociating <R> 406, f 404, and <s> 410. Clock frequency, f 404, isconfigurable based on <R> 406. As seen in the discussion associated withFIG. 3, the output of LUT 312 may yield clock frequency, f 404. As aphase-locked loop (PLL) is used for generation of clock frequency, f404, the PLL is programmed to select an appropriate frequency. The PLLis associated with a clock generation circuit (e.g., CLK GEN 3 136, CLKGEN 2 134, CLK GEN 1 132, CLK GEN 0 130) of a stage of unbalancedinstruction circuit 100.

FIG. 5 illustrates a computing system 500 including processor 502 inwhich unbalanced instruction pipeline 100 is implemented, according toone or more embodiments. Computing system 500 is a personal computer, alaptop, a notebook computer and/or a system utilizing the benefitsassociated with optimized unbalanced instruction pipeline 100. Computingsystem 500 also includes a microcontroller with a processor 502.Computing system 500 includes a memory 504 (e.g., program memory 102)configured to store the instruction code associated with processingthrough unbalanced instruction pipeline 100. Computing system 500 alsoincludes a determination module 506 configured to determine the rate ofresource occupancy of a constituent stage of unbalanced instructionpipeline 100 through profiling the instruction code associated withprocessing through unbalanced instruction pipeline 100. Processor 502 isconfigured to perform processing associated with unbalanced instructionpipeline 100 at a clock frequency based on an optimum a powerconsumption and/or a throughput associated with unbalanced instructionpipeline 100 for the determined rate of resource occupancy of theconstituent stage.

FIG. 6 illustrates a process flow diagram detailing the operationsinvolved in a method of performing optimum data processing throughunbalanced instruction pipeline 100, according to one or moreembodiments. Operation 602 involves determining a rate of resourceoccupancy of a constituent stage of unbalanced instruction pipeline 100implemented in processor 502 through profiling an instruction codeassociated therewith. Operation 604 then involves performing dataprocessing associated with unbalanced instruction pipeline 100 at amaximum throughput at an optimum clock frequency based on the resourceoccupancy.

FIG. 7 illustrates a process flow diagram detailing the operationsinvolved in a method of performing optimum and dynamic data processingthrough unbalanced instruction pipeline 100, according to one or moreembodiments. Operation 702 involves determining a time interval within aprocessing time associated with a constituent stage of unbalancedinstruction pipeline 100 implemented in processor 502 based on a changein a processing scenario associated with data processing. Operation 704involves dynamically determining a rate of resource occupancy of theconstituent stage periodically with a time period equal to thedetermined time interval through profiling an instruction codeassociated therewith. Operation 706 involves periodically obtaining aclock frequency associated with the rate of resource occupancy of theconstituent stage. The clock frequency corresponds to optimized powerconsumption and/or a throughput associated with unbalanced instructionpipeline 100. Operation 708 then involves performing the data processingat the periodically obtained clock frequency.

Exemplary embodiments discussed above can be used in high-performance,low power computing applications. Specifically, the exemplaryembodiments is used in delay-locked loops (DLLs) associated with GlobalPositioning System (GPS) receivers and embedded/Digital SystemProcessing (DSP) applications requiring large-scale processing. Otherapplications utilizing the concepts discussed herein are within thescope of the exemplary embodiments. Stage 3 110 of unbalancedinstruction pipeline 100 may involve a hard macro (e.g., the data memorydiscussed above) therein. The divider logic, adder logic, and multiplierlogic discussed above are merely for purposes of illustration.Modifications in the constituent elements of stages (e.g.,increasing/decreasing the number of constituent elements, varying theconstituent elements) of unbalanced instruction pipeline 100 are wellwithin the scope of the exemplary embodiments. In one or moreembodiments, it is possible that a constituent stage (e.g., stage 3 110)of unbalanced instruction pipeline 100 may include a single element,which may contribute to the maximum delay associated with unbalancedinstruction pipeline 100. Optimization, as discussed above, may then bedone based on the aforementioned single element.

Although the present embodiments have been described with reference tospecific example embodiments, it will be evident that variousmodifications and changes is made to these embodiments without departingfrom the broader spirit and scope of the various embodiments. Forexample, the various systems, devices, apparatuses, and circuits, etc.described herein is enabled and operated using hardware circuitry,firmware, software or any combination of hardware, firmware, or softwareembodied in a machine readable medium. The various electrical structuresand methods is embodied using transistors, logic gates, applicationspecific integrated (ASIC) circuitry or Digital Signal Processor (DSP)circuitry.

In addition, it will be appreciated that the various operations,processes, and methods disclosed herein is embodied in amachine-readable medium or a machine accessible medium compatible with adata processing system, and is performed in any order. Accordingly, thespecification and drawings are to be regarded in an illustrative ratherthan a restrictive sense.

What is claimed is:
 1. A method comprising: determining a rate ofresource occupancy of a constituent stage of an unbalanced instructionpipeline implemented in a processor through profiling an instructioncode; and performing data processing associated with the unbalancedinstruction pipeline at a maximum throughput and at an optimum clockfrequency based on the rate of resource occupancy.
 2. The method ofclaim 1, wherein performing the data processing includes stallingprocessing associated with at least one of the constituent stage of theunbalanced instruction pipeline and a previous stage, for at least anumber of clock cycles corresponding to a delay time associated with theprocessing, through the constituent stage by gating a clock input to theat least one of the constituent stage and the previous stage.
 3. Themethod of claim 1, further comprising: determining a time intervalwithin a processing time associated with the constituent stage of theunbalanced instruction pipeline based on a change in a processingscenario associated with processing; dynamically determining the rate ofresource occupancy of the constituent stage periodically with a timeperiod equal to the determined time interval; and obtaining, at everytime interval, the clock frequency associated with the rate of resourceoccupancy of the constituent stage for performing the data processingassociated with the unbalanced instruction pipeline.
 4. The method ofclaim 3, wherein the clock frequency associated with the data processingis higher than a frequency corresponding to the higher delay timeassociated with the constituent stage.
 5. The method of claim 2, furthercomprising obtaining a number of stall cycles associated with stallingin at least one of the constituent stage of the unbalanced instructionpipeline and a previous stage thereof for at least the number of stallcycles, wherein the number of stall cycles corresponds to a delay timeassociated with the processing through the constituent stage.
 6. Themethod of claim 5, wherein determining the rate of resource occupancy ofthe constituent stage of the unbalanced instruction pipeline includes:inputting a control signal associated with a decoded instructionassociated with the processing through the constituent stage to acounter associated therewith; determining the rate of resource occupancyof the constituent stage through the counter; and maintaining a Look UpTable (LUT) associated with the counter to map the determined rate ofresource occupancy and at least one of the clock frequency and thenumber of stall cycles associated therewith.
 7. The method of claim 6,further comprising: updating hardware associated with the processingthrough the constituent stage with the at least one of the clockfrequency and the number of stall cycles determined through the LUT whenthe at least one of the clock frequency and the number of stall cyclesvaries from a value thereof during a previous time interval; andresetting the counter at the end of the time interval.
 8. The method ofclaim 6, comprising implementing the LUT through a multiplexer havingthe rate of resource occupancy as an input and a select line.
 9. Amethod comprising: determining a time interval within a processing timeassociated with a constituent stage of an unbalanced instructionpipeline implemented in a processor based on a change in a processingscenario associated with data processing; dynamically determining a rateof resource occupancy of the constituent stage periodically with a timeperiod equal to the time interval through profiling an instruction code;periodically obtaining a clock frequency associated with the rate ofresource occupancy of the constituent stage, the clock frequencycorresponding to an optimized at least one of a power consumption and athroughput associated with the unbalanced instruction pipeline; andperforming the data processing at the periodically obtained clockfrequency.
 10. The method of claim 9, further comprising obtaining anumber of stall cycles associated with stalling processing in at leastone of the constituent stage of the unbalanced instruction pipeline anda previous stage thereof for at least the number of stall cycles,wherein the number of stall cycles corresponds to a delay timeassociated with the processing through the constituent stage.
 11. Themethod of claim 9, wherein dynamically determining the rate of resourceoccupancy of the constituent stage includes: inputting a control signalassociated with a decoded instruction associated with the processingthrough the constituent stage to a counter associated therewith;determining the rate of resource occupancy of the constituent stagethrough the counter; and maintaining a Look Up Table (LUT) associatedwith the counter to map the determined rate of resource occupancy and atleast one of the clock frequency and the number of stall cyclesassociated therewith.
 12. The method of claim 11, further comprising:updating hardware associated with the processing through the constituentstage with the at least one of the clock frequency and the number ofstall cycles determined through the LUT when the at least one of theclock frequency and the number of stall cycles varies from a valuethereof during a previous time interval; and resetting the counter atthe end of the time interval.
 13. The method of claim 11, comprisingimplementing the LUT through a multiplexer having the rate of resourceoccupancy as an input and a select line thereof.
 14. A computing systemcomprising: a processor having an unbalanced instruction pipeline; amemory configured to store an instruction code associated withprocessing through the unbalanced instruction pipeline; and adetermination module configured to determine a rate of resourceoccupancy of a constituent stage of the unbalanced instruction pipelinethrough profiling the instruction code associated with processingthrough the unbalanced instruction pipeline, the processor beingconfigured to perform data processing at a maximum throughput at anoptimum clock frequency based on the rate of resource occupancy.
 15. Thecomputing system of claim 14, further comprising a pipeline control unitconfigured to control a clock generation circuit associated with theconstituent stage of the unbalanced instruction pipeline.
 16. Thecomputing system of claim 16, wherein the pipeline control unit furthercomprises a Look Up Table (LUT) implemented therein configured to mapthe rate of resource occupancy of the constituent stage determinedthrough the determination module to at least one of the clock frequencyand a number of stall cycles, wherein the number of stall cycles isassociated with stalling processing in at least one of the constituentstage of the unbalanced instruction pipeline and a previous stagethereof for at least the number of stall cycles, and wherein the numberof stall cycles corresponds to a delay time associated with theprocessing through the constituent stage.